Data cleaning

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Introduction
  • Tidy data
  • Variables
  • Recoding variables
  • Scope of data

Introduction

Introduction

“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.

Source: R for Data Science

Introduction

Why do we need to clean data?

  • Data is not tidy.
  • Data is missing.
  • Data is many datasets.

Tidy data

Wickham rules

A df is tidy if it fulfills these requirements (Wickham 2014):

  • Each df has one unit of observation.
  • Observations represented in the rows.
  • Variables represented in the columns.
  • Each cell indicates a value.

Source: R for Data Science

A tidy dataframe

Intra-State War Data (Correlates of War)

# A tibble: 442 × 4
   WarName            WarType SideA          SideB                           
   <chr>                <dbl> <chr>          <chr>                           
 1 First Caucasus           5 Russia         Georgians, Dhagestania, Chechens
 2 Sidon-Damascus           6 Sidon          Damascus & Aleppo               
 3 First Two Sicilies       4 Austria        -8                              
 4 First Two Sicilies       4 Two Sicilies   Liberals                        
 5 Spanish Royalists        4 Spain          Royalists                       
 6 Sardinian Revolt         4 Austria        -8                              
 7 Sardinian Revolt         4 Sardinia       Carbonari                       
 8 Greek Independence       5 Ottoman Empire Greeks                          
 9 Greek Independence       5 -8             United Kingdom                  
10 Greek Independence       5 -8             France                          
# … with 432 more rows

Untidy data

More datasets, opening the tidyr:: package.

# A tibble: 18 × 11
   religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Agnostic           27      34      60      81      76     137     122     109
 2 Atheist            12      27      37      52      35      70      73      59
 3 Buddhist           27      21      30      34      33      58      62      39
 4 Catholic          418     617     732     670     638    1116     949     792
 5 Don’t know/r…      15      14      15      11      10      35      21      17
 6 Evangelical …     575     869    1064     982     881    1486     949     723
 7 Hindu               1       9       7       9      11      34      47      48
 8 Historically…     228     244     236     238     197     223     131      81
 9 Jehovah's Wi…      20      27      24      24      21      30      15      11
10 Jewish             19      19      25      25      30      95      69      87
11 Mainline Prot     289     495     619     655     651    1107     939     753
12 Mormon             29      40      48      51      56     112      85      49
13 Muslim              6       7       9      10       9      23      16       8
14 Orthodox           13      17      23      32      32      47      38      42
15 Other Christ…       9       7      11      13      13      14      18      14
16 Other Faiths       20      33      40      46      49      63      46      40
17 Other World …       5       2       3       4       2       7       3       4
18 Unaffiliated      217     299     374     365     341     528     407     321
# … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
#   abbreviated variable names ¹​`$10-20k`, ²​`$20-30k`, ³​`$30-40k`, ⁴​`$40-50k`,
#   ⁵​`$50-75k`, ⁶​`$75-100k`, ⁷​`$100-150k`

Untidy data

Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).

Table A

# A tibble: 6 × 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Table B

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Table C

# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Table D

# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Pivoting data

We change the rows and columns of the dataframe keeping the same information.

Pivot wider

pivot_wider(df, names_from, values_from)

Pivot longer

pivot_longer(df, cols, names_to, names_to)

Separate

separate(df, cols, names_to, names_to)

Unite

unite(df, cols, names_to, names_to)

Exercise

?world_bank_pop

Pivoting dataframes

What does it mean?

Bullets

When you click the Render button a document will be generated that includes:

  • Content authored with markdown
  • Output from executable code

Code

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:

[1] 2
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.